\documentclass{standalone}
\usepackage{placeins}
\begin{document}

    \section{Data Manipulation}

    This section covers conversions between Spark and H2O Frames and also ways how H2O Frames can be created directly.

    \subsection{Creating H2O Frames}

    This section covers multiple ways how a H2O Frame can be created.

    \subsubsection{Convert from RDD, DataFrame or Dataset}

    H2OFrame can be created by converting Spark entities into it. \textbf{H2OContext} provides the
    \textbf{asH2OFrame} method which accepts Spark Dataset, DataFrame and RDD[T] as an input parameter.
    In the case of RDD, the type \texttt{T} has to satisfy the upper bound expressed by the type \texttt{Product}.
    The conversion will create a new \texttt{H2OFrame}, transfer data from the specified RDD, and save it to the
    H2O K/V data store.

    Example of converting Spark DataFrame into H2OFrame is:

    \begin{itemize}
        \item \textbf{Scala} \begin{lstlisting}[style=Scala]
h2oContext.asH2OFrame(sparkDf)
        \end{lstlisting}
        \item \textbf{Python} \begin{lstlisting}[style=Python]
h2oContext.asH2OFrame(sparkDf)
        \end{lstlisting}
        \item \textbf{R} \begin{lstlisting}[style=R]
h2oContext$asH2OFrame(sparkDf)
        \end{lstlisting}
    \end{itemize}

    You can also specify the name of the resulting H2OFrame as

    \begin{itemize}
        \item \textbf{Scala} \begin{lstlisting}[style=Scala]
h2oContext.asH2OFrame(sparkDf, "frameName")
        \end{lstlisting}
        \item \textbf{Python} \begin{lstlisting}[style=Python]
h2oContext.asH2OFrame(sparkDf, "frameName")
        \end{lstlisting}
        \item \textbf{R} \begin{lstlisting}[style=R]
h2oContext$asH2OFrame(sparkDf, "frameName")
        \end{lstlisting}
    \end{itemize}

    The method is overloaded and accepts also RDDs and Datasets as inputs.

    \subsubsection{Creating H2OFrame from an Existing Key}

    If the H2O cluster already contains a loaded \texttt{H2OFrame} referenced by key, for example, \texttt{train.hex}, it
    is possible to reference it from Sparkling Water by creating a proxy \texttt{H2OFrame} instance using the key as the input:

    \begin{itemize}
        \item \textbf{Scala} \begin{lstlisting}[style=Scala]
import ai.h2o.sparkling.H2OFrame
val frame = H2OFrame("train.hex")
        \end{lstlisting}
        \item \textbf{Python} \begin{lstlisting}[style=Python]
import h2o
val frame = h2o.get_frame("train.hex")
        \end{lstlisting}
        \item \textbf{R} \begin{lstlisting}[style=R]
library(h2o)
val frame = h2o.getFrame("train.hex")
        \end{lstlisting}
    \end{itemize}

    \subsubsection{Create H2O Frame Directly}

    H2O Frame can also be created directly. To see how you can create H2O Frame directly in Python and R, please see
    H2O documentation at \url{https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging.html}.

    To create H2O Frame directly in Scala, you can:

    \begin{itemize}
        \item load a cluster local file (a file located on each node of the cluster):
        \begin{lstlisting}[style=Scala]
val h2oFrame = H2OFrame(new File("/data/iris.csv"))
        \end{lstlisting}
        \item load file from HDFS/S3/S3N/S3A:
        \begin{lstlisting}[style=Scala]
val h2oFrame = H2OFrame(URI.create("hdfs://data/iris.csv"))
        \end{lstlisting}
    \end{itemize}

    \subsection{Converting H2O Frames to Spark entities}

    This section covers how we can convert H2O Frame to Spark entities.

    \subsubsection{Convert to RDD}

    Conversion to RDD is only available in the Scala API of Sparkling Water

    The \textbf{H2OContext} class provides the method \textbf{asRDD}, which creates an RDD-like wrapper around the
    provided \textbf{H2OFrame}.

    The method expects the type \texttt{T} to create a correctly-typed RDD. The conversion requires type \texttt{T} to
    be bound by \texttt{Product} interface. The relationship between the columns of \texttt{H2OFrame} and the
    attributes of class \texttt{A} is based on name matching.

    Example of converting H2OFrame into Spark RDD is:

    \begin{lstlisting}[style=Scala]
val df: H2OFrame = ...
h2oContext.asRDD[Weather](df)
    \end{lstlisting}

    \subsubsection{Convert to DataFrame}

    The \textbf{H2OContext} class provides the method \textbf{asSparkFrame}, which creates a \texttt{DataFrame}-like
    wrapper around the provided \texttt{H2OFrame}.

    The schema of the created instance of the \texttt{DataFrame} is derived from the column names and the types of
    the specified \texttt{H2OFrame}.

    Example of converting H2OFrame into Spark DataFrame is:
    \begin{itemize}
        \item \textbf{Scala} \begin{lstlisting}[style=Scala]
h2oContext.asSparkFrame(h2oFrame)
        \end{lstlisting}
        \item \textbf{Python} \begin{lstlisting}[style=Python]
h2oContext.asSparkFrame(h2oFrame)
        \end{lstlisting}
        \item \textbf{R} \begin{lstlisting}[style=R]
h2oContext$asSparkFrame(h2oFrame)
        \end{lstlisting}
    \end{itemize}

    \subsection{Mapping between H2OFrame And Data Frame Types}

    For all primitive types or Spark SQL types (see \linebreak \texttt{org.apache.spark.sql.types}) which
    can be part of Spark RDD/DataFrame/Dataset, we provide mapping into H2O vector types (numeric,
    categorical, string, time, UUID - see \texttt{water.fvec.Vec}):

    \begin{table}[!ht]
        \centering
        \begin{tabular}{l l l}
            \toprule
            Primitive type     & SQL type   & H2O type \\
            \midrule
            NA                 & BinaryType    & Numeric  \\
            Byte               & ByteType      & Numeric  \\
            Short              & ShortType     & Numeric  \\
            Integer            & IntegerType   & Numeric  \\
            Long               & LongType      & Numeric  \\
            Float              & FloatType     & Numeric  \\
            Double             & DoubleType    & Numeric  \\
            String             & StringType    & String   \\
            Boolean            & BooleanType   & Numeric  \\
            java.sql.TimeStamp & TimestampType & Time     \\
            \bottomrule
        \end{tabular}
    \end{table}

    \newpage
    \subsection{Mapping between H2OFrame and RDD[T] Types}

    For converting rdd to H2OFrame and vice-versa, as type \texttt{T} in RDD[T] we support following types:

    \begin{table}[!ht]
        \centering
        \begin{tabular}{l}
            \toprule
            \textbf{T}                                       \\
            \midrule
            NA                                               \\
            Byte                                             \\
            Short                                            \\
            Integer                                          \\
            Long                                             \\
            Float                                            \\
            Double                                           \\
            String                                           \\
            Boolean                                          \\
            java.sql.Timestamp                               \\
            Any scala class extending scala \texttt{Product} \\
            org.apache.spark.mllib.regression.LabeledPoint   \\
            org.apache.spark.ml.linalg.Vector                \\
            org.apache.spark.mllib.linalg                    \\
            \bottomrule
        \end{tabular}
    \end{table}

    \subsection{Using Spark Data Sources with H2OFrame}
    Spark SQL provides a configurable data source for SQL tables. Sparkling Water enables \texttt{H2OFrame}s to be
    used as a data source to load/save data from/to Spark DataFrame. This API is supported only in Scala and
    Python APIs

    \subsubsection{Reading from \texttt{H2OFrame}}

    To read H2OFrame as Spark DataFrame, run:

    \begin{itemize}
        \item \textbf{Scala} \begin{lstlisting}[style=Scala]
spark.read.format("h2o").load(frameKey)
        \end{lstlisting}
        \item \textbf{Python} \begin{lstlisting}[style=Python]
spark.read.format("h2o").load(frameKey)
        \end{lstlisting}
    \end{itemize}

    You can also specify the key as option:

    \begin{itemize}
        \item \textbf{Scala} \begin{lstlisting}[style=Scala]
spark.read.format("h2o").option("key", frameKey).load()
        \end{lstlisting}
        \item \textbf{Python} \begin{lstlisting}[style=Python]
spark.read.format("h2o").option("key", frameKey).load()
        \end{lstlisting}
    \end{itemize}

    If you specify the key as the option and inside the load method, the option
    has the precedence.

    \subsubsection{Saving to \texttt{H2OFrame}}

    To save DataFrame into a H2OFrame, run:

    \begin{itemize}
        \item \textbf{Scala} \begin{lstlisting}[style=Scala]
df.write.format("h2o").save("new_key")
        \end{lstlisting}
        \item \textbf{Python} \begin{lstlisting}[style=Python]
df.write.format("h2o").save("new_key")
        \end{lstlisting}
    \end{itemize}

    You can also specify the key as option:

    \begin{itemize}
        \item \textbf{Scala} \begin{lstlisting}[style=Scala]
df.write.format("h2o").option("key", "new_key").save()
        \end{lstlisting}
        \item \textbf{Python} \begin{lstlisting}[style=Python]
df.write.format("h2o").option("key", "new_key").save()
        \end{lstlisting}
    \end{itemize}

    If you specify the key as the option and inside the save method, the option
    has the precedence.

    \subsubsection{Specifying Saving Mode}

    There are four save modes available when saving data using Data Source
    API - see \url{http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes}

    \begin{itemize}
        \item If \textbf{append} mode is used, an existing \texttt{H2OFrame} with the same key is deleted, and a new one created with the same key. The new frame contains the union of all rows from the original \texttt{H2OFrame} and the appended \texttt{DataFrame}.
        \item If \textbf{overwrite} mode is used, an existing \texttt{H2OFrame} with the same key is deleted, and a new one with the new rows is created with the same key.
        \item If \textbf{error} mode is used, and a \texttt{H2OFrame} with the specified key already exists, an exception is thrown.
        \item If \textbf{ignore} mode is used, and a \texttt{H2OFrame} with the specified key already exists, no data are changed.
    \end{itemize}

\end{document}
